A Compositional Framework for Grounding Language Inference, Generation, and Acquisition in Video

نویسندگان

  • Haonan Yu
  • N. Siddharth
  • Andrei Barbu
  • Jeffrey Mark Siskind
چکیده

We present an approach to simultaneously reasoning about a video clip and an entire natural-language sentence. The compositional nature of language is exploited to construct models which represent the meanings of entire sentences composed out of the meanings of the words in those sentences mediated by a grammar that encodes the predicate-argument relations. We demonstrate that these models faithfully represent the meanings of sentences and are sensitive to how the roles played by participants (nouns), their characteristics (adjectives), the actions performed (verbs), the manner of such actions (adverbs), and changing spatial relations between participants (prepositions) affect the meaning of a sentence and how it is grounded in video. We exploit this methodology in three ways. In the first, a video clip along with a sentence are taken as input and the participants in the event described by the sentence are highlighted, even when the clip depicts multiple similar simultaneous events. In the second, a video clip is taken as input without a sentence and a sentence is generated that describes an event in that clip. In the third, a corpus of video clips is paired with sentences which describe some of the events in those clips and the meanings of the words in those sentences are learned. We learn these meanings without needing to specify which attribute of the video clips each word in a given sentence refers to. The learned meaning representations are shown to be intelligible to humans.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework

Recently, joint video-language modeling has been attracting more and more attention. However, most existing approaches focus on exploring the language model upon on a fixed visual model. In this paper, we propose a unified framework that jointly models video and the corresponding text sentences. The framework consists of three parts: a compositional semantics language model, a deep video model ...

متن کامل

Grounding Compositional Hypothesis Generation in Specific Instances

A number of recent computational models treat concept learning as a form of probabilistic rule induction in a space of language-like, compositional concepts. Inference in such models frequently requires repeatedly sampling from a (infinite) distribution over possible concept rules and comparing their relative likelihood in light of current data or evidence. However, we argue that most existing ...

متن کامل

Approaching the Symbol Grounding Problem with Probabilistic Graphical Models

In order for robots to engage in dialog with human teammates, they must have the ability to identify correspondences between elements of language and aspects of the external world. A solution to this symbol grounding problem (Harnad, 1990) would enable a robot to interpret commands such as “Drive over to receiving and pick up the tire pallet.” This article describes several of our results that ...

متن کامل

Natural Language Acquisition and Grounding for Embodied Robotic Systems

We present a cognitively plausible novel framework capable of learning the grounding in visual semantics and the grammar of natural language commands given to a robot in a table top environment. The input to the system consists of video clips of a manually controlled robot arm, paired with natural language commands describing the action. No prior knowledge is assumed about the meaning of words,...

متن کامل

Interpreting the Validity of a High-Stakes Test in Light of the Argument-Based Framework: Implications for Test Improvement

The validity of large-scale assessments may be compromised, partly due to their content inappropriateness or construct underrepresentation. Few validity studies have focused on such assessments within an argument-based framework. This study analyzed the domain description and evaluation inference of the Ph.D. Entrance Exam of ELT (PEEE) sat by Ph.D. examinees (n = 999) in 2014 in Iran....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • J. Artif. Intell. Res.

دوره 52  شماره 

صفحات  -

تاریخ انتشار 2015